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Abstract 

Performance bounds are given for exploratory co-clustering/ blockmodeling of bipar¬ 
tite graph data, where we assume the rows and columns of the data matrix are samples 
from an arbitrary population. This is equivalent to assuming that the data is generated 
from a nonsmooth graphon. It is shown that co-clusters found by any method can be ex¬ 
tended to the row and column populations, or equivalently that the estimated blockmodel 
approximates a blocked version of the generative graphon, with estimation error bounded 
by ()p(rr l//2 ). Analogous performance bounds are also given for degree-corrected block- 
models and random dot product graphs, with error rates depending on the dimensionality 
of the latent variable space. 


1 Introduction 

In the statistical analysis of network data, blockmodeling (or community detection) and its 
variants are a popular class of methods that have been tried in many applications, such as 
modeling of communication patterns [7], academic citations [17], protein networks [1], online 
behavior [21, 30], and ecological networks [15]. 

In order to develop a theoretical understanding, many recent papers have established con¬ 
sistency properties for the blockmodel. In these papers, the observed network is assumed to 
be generated using a set of latent variables that assign the vertices into groups (the “ commu¬ 
nities”), and the inferential goal is to recover the correct group membership from the observed 
data. Various conditions have been established under which recovery is possible [5, 6] and also 
computationally tractable [10, 11, 20, 24, 28]. Additionally, conditions are also known under 
which no algorithm can correctly recover the group memberships [13, 23]. 

The existence of a true group membership is central to these results. In particular, they 
assume a generative model in which all members of the same group are statistically identical. 
This implies that the group memberships explain the entirety of the network structure. In 
practice, we might not expect this assumption to even approximately hold, and the objective of 
finding “true communities” could be difficult to define precisely, so that a more reasonable goal 
might be to discover group labels which partially explain structure that is evident in the data. 
Comparatively little work has been done to understand blockmodeling from this viewpoint. 
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To address this gap, we consider the problem of blockmodeling under model misspeci- 
fication. We assume that the data is generated not by a blockmodel, but by a much larger 
nonparametric class known as a graphon. This is equivalent to assuming that the vertices are 
sampled from an underlying population, in which no two members are identical and the notion 
of a true community partition need not exist. In this setting, blockmodeling might be better 
understood not as a generative model, but rather as an exploratory method for finding high- 
level structure: by dividing the vertices into groups, we divide the network into subgraphs that 
can exhibit varying levels of connectivity. This is analogous to the usage of histograms to find 
high and low density regions in a nonparametric distribution. Just as a histogram replicates the 
binned version of its underlying distribution without restrictive assumptions, we will show that 
the blockmodel replicates structure in the underlying population when the observed network is 
generated from an arbitrary graphon. 

Our results are restricted to the case of bipartite graph data. Such data arises naturally in 
many settings, such as customer-product networks where connections may represent purchases, 
reviews, or some other interaction between people and products. 

The organization of the paper is as follows. Related work is discussed in Section 2. In Sec¬ 
tion 3, we define the blockmodeling problem for bipartite data generated from a graphon, and 
present a result showing that the blockmodel can detect structure in the underlying population. 
In Section 4, we discuss extensions of the blockmodel, such as mixed-membership, and give a 
result regarding the behavior of the excess risk in such models. Section 5 contains a sketch and 
proof for the main theorem. Auxilliary results and extensions are proven in the Appendix. 


2 Related Works 

The papers [2, 14, 19, 25], and [12] are most similar to the present work, in that they consider 
the problem of approximating a graphon by a blockmodel. The papers [2, 14, 19] and [25] 
consider both bipartite and non-bipartite graph data, and require the generative graphon to 
satisfy a smoothness condition, with [14] establishing a minimax error rate and [19] extending 
the results to a class of sparse graphon models. In a similar vein, [29] shows consistent and 
computationally efficient estimation assuming a type of low rank generative model. While 
smoothness and rank assumptions are natural for many non-parametric regression problems, it 
seems difficult to judge whether they are appropriate for network data and if they are indeed 
necessary for good performance. 

In [12] and in this present paper, which consider only bipartite graphs, the emphasis is 
on exploratory analysis. Hence no assumptions are placed on the generative graphon. Unlike 
the works which assume smoothness or low rank structure, the object of inference is not the 
generative model itself, but rather a blocked version of it (this is defined precisely in Section 
3). This is reminiscent of some results for confidence intervals in nonparametric regression, 
in which the interval is centered not on the generative function or density itself, but rather on 
a smoothed or histogram-ed version [31, Sec 5.7 and Thm 6.20]. The present paper can be 
viewed as a substantial improvement over [12]; for example, Theorem 1 improves the rates 
of convergence from Op(n -1 / 4 ) to Op(n -1 / 2 ), and also applies to computationally efficient 
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estimates. 


3 Co-clustering of nonsmooth graphons 

In this section, we give a formulation for co-clustering (or co-blockmodeling) in which the rows 
and columns of the data matrix are samples from row and column populations, and correspond 
to the vertices of a bipartite graph. We then present an approximation result which implies that 
any co-clustering of the rows and columns of the data matrix can be extended to the populations. 
Roughly speaking, this means that if a co-clustering “reveals structure” in the data matrix, then 
similar structure will also exist at the population level. 

3.1 Problem Formulation 

Data generating process Let A G {0, l} mxn denote a binary m x n matrix representing the 
observed data. For example, A tJ could denote whether person i rated movie j favorably, or 
whether gene i was expressed under condition j. 

We assume that A is generated by the following model, in which each row and column of 
A is associated with a latent variable that is sampled from a population: 

Definition 1 (Bipartite Graphon). Given m and n, let Xi,, x rn and y\..... y n denote i.i.d. 
uniform [ 0 , 1 ] latent variables 

xi,...,x m ~ Unif[0,1] and yi,...,y n ~ Unif[0,1]. 

Let co : [0, l] 2 i— > [0,1] specify the distribution of A G {0,1 } mxn , conditioned the latent 
variables {x i \™ =l and { Uj}]= ], 

Aij ~ Bernoulli (c o(xi,yj)) , i G [m],j G [n] 

where the Bernoulli random variables are independent. 

We will require that co be measurable and bounded between 0 and 1, but may otherwise be 
arbitrarily non-smooth. We will use X = [0,1] and y = [0,1] to denote the populations from 
which { Xj } and {?/ ? -} are sampled. 

Co-clustering In co-clustering, the rows and columns of a data matrix A are simultaneously 
clustered to reveal submatrices of A that have similar values. When A is binary valued, this is 
also called blockmodeling (or co-blockmodeling). 

Our notation for co-clustering is the following. Let K denote the number of clusters. Let 
S G [K] m denote a vector identifying the cluster labels corresponding to the m rows of A, e.g., 
Si = k means that the /th row is assigned to cluster k. Similarly, let T G [. K] n identify the 
cluster labels corresponding to the n rows of A. Given (S', T ), let $a(S, T) G [0, l] XxA denote 
the normalized sums for the submatrices of A induced by S and T: 

1 m n 

[‘MS, T)]„ = — X) E -M(Si = s > T 1 = t), », t e [K\. 
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Let ns G [0,1] A and tit G [0, 1] K denote the fraction of rows or columns in each cluster: 


’rs(s) = — 

m 

i =1 


= S) 


and Tr T (t) = -^2l( T j = t) 


3 =1 


Let the average value of the (s, t)th submatrix be denoted by 9 st , given by 

[*x(s,r)]„ 


9 st — 


n s (s)n T (t) 


Generally, S and T are chosen heuristically to make the entries of 9 far from the overall average 
of A. A common approach is to perform k-means clustering of the spectral coordinates for each 
row and column of A [26]. Heterogeneous values of 9 can be interpreted as revealing subgroups 
of the rows and columns in A. 

Population co-blockmodel Given a co-clustering (S, T ) of the rows and columns of A, we 
will consider whether similar subgroups also exist in the unobserved populations X and y. 
Let a : X H > [ K] and r : y H > [ K] denote mappings that co-cluster the row and column 
populations X and y. Let ^(cr, r) G [0, 1 ] Ax A denote the integral of to within the induced 
co-clusters, or the blocked version of ur. 

[&u,{<T,T)] st = u(x,y)l(<7(x) = s,r(y) =t)dxdy, s,t e [K\. 

Jxxy 


Let (S, t) G [0,1] Ax/ ' denote the integral of u within the induced co-clusters, over {x \,. 

i m r 

[$ , u J (S,T)] st = —^2 / u(xi,y)l(Si = s,r(y) = t)dy. 

Let 7r(cr) and 7r(r) denote the fraction of the population in each cluster: 


.,x n }x 


7r a (s) = / l(cr(x) = s)dx 

Jx 


and 7T r (t) = / 1 (r(y)=t)dy. 

Jy 


Theorem 1 will show that for each clustering S, T, there exists a : X i —> [. K ] and r : J 4 
[A"] which cluster the populations X and y such that &a(S, T ) ~ $^,(5, r) and &a(S, T ) ~ 
$>u{cr, r), as well as ns ~ n a and n T ~ n T , implying that subgroups found by co-clustering A 
are indicative of similar structure in the populations X and y. 

3.2 Approximation Result for Co-clustering 

Theorem 1 states that for each (A, T ) G [. K] m x [A']", there exists population co-clusters ay : 
X [AT] and t t '■ y t-4 [A'] such that Q A (S,T) ~ r T ) ~ $ w (crs, t t ), and also 

n s ~ n as and n T ~ n TT . 
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Theorem 1. Let A E {0, 1 } mxn be generated by some u according to Definition 1 , with fixed 
ratio m/n. Let (S, T ) denote vectors in [. K] m and \K] " respectively, with K < n l A. 

1. For each T E [K] n , there exists tt : y *->■ \K\ such that 


^ ™ ax \\®a(S, T) - $ U (S, tt) || + ||vr T - tt Tt 

S,Te[K}™x[K]™ 

2. For each S E [K) m , there exists os : X i—> [K], such that 


sup linos', t) - <M cr s,r)|| + \\ir s - n as 
S,re[K] m x[K]y 

3. Combining (1) and (2) yields 

max ||$ w (cr s ,r T ) - T)|| + Ikr - 7T Tt || + ||7r s - ^11 (3) 

S,T£[K]™x[K] n 

^/A' 2 logn\ 

Remarks for Theorem 1 To give context to Theorem 1, suppose that A E {0, l} mxn rep¬ 
resents product-customer interactions, where A tJ = 1 indicates that product i was purchased 
(or viewed, reviewed, etc.) by customer j. We assume A is generated by Definition 1, mean¬ 
ing that the products and customers are samples from populations. This could be literally true 
if A is sampled from a larger data set, or the populations might only be conceptual, perhaps 
representing future products and potential customers. 

Suppose that we have discovered cluster labels S' E \ I\] rn and T E [A']'' producing a density 
matrix 6 with heterogeneous values. These clusters can be interpreted as product categories 
and customer subgroups, with heterogeneity in 6 indicating that each customer subgroup may 
prefer certain product categories over others. We are interested in the following question: will 
this pattern generalize to the populations X and yi Or is it descriptive, holding only for the 
particular customers and products that are in the data matrix A2 

An answer is given by Theorem 1. Specifically, (1) and (3) show different senses in which 
the co-clustering ( S,T ) may generalize to the underlying populations. (1) implies that the 
customer population y will be similar to the n observed customers in the data, regarding their 
purchases of the m observed products when aggregated by product category. (3) implies a 
similar result, but for their purchases of the entire population X of products aggregated by 
product category, as opposed only to the m observed products in the data. 

Since Theorem 1 holds for all (S', T ), it applies regardless of the algorithm that is used to 
choose the co-blockmodel. It also applies to nested or hierarchical clusters. If (1) or (3) holds 
at the lowest level of hierarchy with K classes, then it also holds for the aggregated values at 
higher levels as well, albeit with the error term increased by a factor which is at most AT. 



■‘•'(FF) >» 
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Theorem 1 controls the behavior of "s- and ttt, instead of the density matrix 9 which 
may be of interest. However, since 9 is derived from the previous quantities, it follows that 
Theorem 1 also implies control of 9 for all co-clusters involving m 1 / 2 rows or n 1 / 2 
columns. 

All constants hidden by the Op(-) notation in Theorem 1 are universal, in that they do not 
depend on us (but do depend on the ratio m/n). 

4 Application of Theorem 1 to Bipartite Graph Models 

In many existing models for bipartite graphs, the rows and columns of the adjacency matrix 
A G {0, l} mxn are associated with latent variables that are not in X and y, but in other spaces S 
and T instead. In this section, we give examples of such models and discuss their estimation by 
minimizing empirical squared error. We define the population risk as the difference between the 
estimated and actual models, under a transformation mapping X to S and y to T. Theorem 2 
shows that the empirical error surface converges uniformly to the population risk. The theorem 
does not assume a correctly specified model, but rather that the data is generated by an arbitrary 
us following Definition 1 . 

4.1 Examples of Bipartite Graph Models 

We consider models in which the rows and columns of A are associated with latent variables 
that take values in spaces other than X and y. To describe these models, we will use S = 
(Si, , S rn ) and T = (Tj,..., T n ) to denote the row and column latent variables, and S 
and T to denote their allowable values. Let 0 denote a parameter space. Given 0 £ 0, let 
ue'.SxT i —y [0,1] determine the distribution of A conditioned on (S, T), so that the entries 
{Aij} are conditionally independent Bernoulli variables, with P (A^ = 11 S. T ) = u>g(Si, Tj). 

1. Stochastic co-blockmodel with K classes: Let S — T — [AT] and 0 = [0, l] AxA . For 
9 G 0, let usg be given by 

ujg(s,t) = 9 st , s,t eS xT 

where s E S and t G T are row and column co-cluster labels. 

2. Degree-corrected co-blockmodel [18, 32]: Let S = T = [K] x [0,1) and 0 = 

[0, l] AxA . Given u,v G [A"] and b, cl G [0,1), let s = (u, b) and t = (v,d). Let tug 
be given by 

u)g(s,t ) = bd9 uv , s,t G S x T. 

In this model, u, v G [A'] are co-cluster labels, and h. d G [0,1) are degree parameters, 
allowing for degree heterogeneity within co-clusters. 

3. Random Dot Product [16, 29]: Let S = T = {c G [0, l) d : \\c\\ < 1}. Let us be given 
by 

us(s, t) = s T t, s,t G S x T . 
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4. Dot Product + Blockmodel: Models 1-3 are instances of a somewhat more general 
model. Let V = {c G [0, l) d : ||c|| < 1}. Let S = T = [K } x V and 0 = [0, l] KxK . 
Given u, v G [A"] and b,d eV, let s = (u , 6) and t = (v, d ). Let be given by 

u} 6 (s,t) =b T d9 uv . ( 4 ) 


4.2 Empirical and Population Risk 

Given a data matrix A G {0, and a model specification (<S, T, 0), one method for 

estimating (A, T, 0) G S m x T n x 0 is to minimize the empirical squared error R A , given by 

1 m n 

R a (S,T: 0) = —TV' (Aj - u,(S„Tj)) 2 . 
nm 

i=i j =i 

Generally, the global minimum of R A will be intractable to compute, so a local minimum is 
used for the estimate instead. 

If a model (5, T, 9) is found by minimizing or exploring the empirical risk surface R A , does 
it approximate the generative cu? We will define the population risk in two different ways: 

1. Approximation of u by cu e : Let a and r denote mappings X \S and y (-)• T, and let 
R u be given by 

Ru{a,T\Q)= / [uj(x,y) - uj e (a(x),r(y ))] 2 dxdy, 

J Xxy 

denoting the error between the mapping (x,y) ue(a(x),r(y),9) and the generative 

u. If there exists 9 such that R UJ (a, r; 9) is low for some a : X S and t : y ^ T, 
then uig (or more precisely, its transformation (x. y) h-> x'o(a(x), r(y)) can be considered 
a good approximation to u. 


2. Approximation of a* = arg min CT R ja, r, 9) by S: Overloading notation, let R U (S, r,; 9) 
denote 

m 

Ru(S,t-,9) = — / [v&uy)-vo(Si,T(y))] 2 dy. 

m i =i -'y 


To motivate this quantity, consider that given (r, 9), the optimal partition a* : [0, 1] h-> 
[A"] is the greedy assignment for each x G [0,1]: 

cr*(x) = arg min / [u(x u y) - uj e {s, r(y))} 2 dy. 

»e[*l J o,i 

If there exists (S, 9) such that R UJ (S, r; 9) is low for some choice of r, then S can be 
considered a good approximation to the corresponding {a *(x i )}"f 1 . 

Theorem 2 will imply that for models of the form (4), minimizing R A is asymptotically a rea¬ 
sonable proxy for minimizing R= (by both metrics described above), with rates of convergence 
depending on the covering numbers of S and T. 
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4.3 Convergence of the Empirical Risk Function 

Theorem 2 gives uniform bounds between R A and R u for models of form (4). Specifically, for 
each choice of (S', T) G S m x T n , there exists transformations as : X H- S and tt : y H- T 
such that Ra(S,T]6 ) ~ R^(as, tt] 9) ~ R uj (S,tt]0), up to an additive constant and with 
uniform convergence rates depending on d and K. As a result, minimization of Ra(S, T; 9) is 
a reasonable proxy for minimizing R^, by either measure defined in Section 4.2. 

In addition, the mappings as and r T will resemble S and T, in that they will induce similar 
distributions over the latent variables. To quantify this, we define the following quantities. 
Given S G [K] m x V m , we will let S' = (U, B), where U G [K ] m and B G 'D rn , and similarly 
let T = (V, D ) where V G [K] n and D G V n . Likewise, given a : X hg [K] x V, we will 
let a = (n,(3), where /i : X hg [K] and j3 : X hg V, and similarly let r = (z/, 5) where 
v : y I— [K] and <5 : y >-)• V. Let 4/ s, 4 't, and T r denote the CDFs of the values given by 
S', T, a and r, which are functions [K] x [0, l) d hg [0,1] equaling: 



where inequalities of the form c < c! for c, c! G [0, l) d are satisfied if they hold entrywise. 

Theorem 2. Let A G {0, l| mxn ; with fixed ratio m/n, be generated by some oj according to 
Definition 1. Let (5, T, 0) denote a model of the form (4). 

1. For each T G T n , there exists Tt ■ y T such that 


max 

S,T,9eS m xT n xe 


\Ra(S,T ; 0)-R U} (S,t t -,9) ^C 1 \ + 



(5) 



where C\ G M is constant in (S', T, 0). 

2. For each S G S m , there exists as : X h-?• S such that 


s,T,ees m xT y xe 


sup \R w (S,t;9) - Ru{a s ,T]6) — C 2 I + 


^g-^a s || 2 

Kd 


( 6 ) 



where C 2 Gl is constant in (S', r, 6). 
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3. Combining (5) and (6) yields 


max 

S',T,0€iS m xT n x© 





Remarks for Theorem 2 Theorem 2 states that any assignment S and T of latent variables 
to the rows and columns can be extended to the populations, such that the population exhibits 
a similar distribution of values in S and T, and the population risk as a function of 9 is close to 
the empirical risk. 

The theorem may also be viewed as an oracle inequality, in that for any fixed S and T, 
minimizing 6 h© Ra{S,T,9) is approximately equivalent to minimizing 9 h© Rjas - tt,0), 
as if the model cu were known. This implies that the best parametric approximation to c o can be 
learned, for any choice of a s and tt- However, it is not known whether the mappings S ^ cr s 
and T n- t t are approximately onto; if not, minimization of Ra over (S', T, 9) is a reasonable 
proxy for minimization of R w over (a, r, 9), but only over a subset of the possible mappings 
o : 3C i—^ S and r : y T. 

The convergence of 'T 5 to T^ s . is established in Euclidean norm. This implies pointwise 
convergence at every continuity point of 'f as ., thus implying weak convergence and also con¬ 
vergence in Wasserstein distance. 

The proof is contained in Appendix B. It is similar to that of Theorem 1, but requires sub¬ 
stantially more notation due to the additional parameters. Essentially, the proof approximates 
the model of (4) by a blockmodel, and then applies Theorem 1 to bound the difference between 
Ra and /A. 

5 Proof of Theorem 1 

We present a sketch of the proof for Theorem 1, which defines the most important quantities. 
We then present helper lemmas and give the proof of the theorem. 

5.1 Proof Sketch 

Let W E [0, l] mxn denote the expectation of A, conditioned on the latent variables xi,..., x m 
and ?/i,..., y n : 

Wij = u(xi,yj), ie[m],j e [m], 
and let (S', T) denote the conditional expectation of T/iOS', T): 



m n 


i= 1 j =1 
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Given co-cluster labels S G [K] m and T G [K] n , let ls =s G {0, l} m and 1 T =t G {0, l} n denote 
the indicator variables 



1 if Si — s 

0 otherwise 


and 1 T=t(j) 


0 otherwise. 


1 if Tj = t 


Let g T=t G [0, l] m denote the vector n 1 Wl T =t, or 


1 


3 = 1 

It can be seen that the entries of $w(S, T) can be written as 


[ < L H /(S', T)] st — — (ls =8 , gr=t), 
m 


(7) 


where (■, •) denotes inner product. Similarly, the entries of <&u(S, r) can be written as 


1 


[ < L H /(S', r)] st — 


(ls=S) 9r=t ) ; 


( 8 ) 


m 


where g T=t G [0, l] m is the vector 

g T =t{i) = / u(xi,y)l{T(y) =t}dy, i G [m], 

Jy 

The proof of Theorem 1 will require three main steps: 

SI: In Lemma 1 , a concentration inequality will be used to show that < T y i(S', T ) <I ) vv(*S', T) 

uniformly over all possible values of (S, T). 

S2: For each T G [ K] n , we will show there exists r : y i— y [K] such that gr=t ~ fh=t for 
t G [K\. By (7) and (8), this will imply that $ W (S,T) ~ $ CJ (S', r) uniformly for all 
S G [ K] m . The mapping r will also satisfy n T « tt t as well, so that T and r have similar 
class frequencies. 

S3: Analogous to S2, we will show that for each S G [ K} m , there exists o : X [K] such 
that r) ~ ^(as, r) uniformly over r, and also that ~ n rTs . 

Steps SI and S2 correspond to (1) in Theorem 1, while step S3 corresponds to (2). 

Let Gt and G r denote the stacked vectors in W nk +K given by 



and 



and let Q n and Q denote the set of all possible values for Gt and G T \ 

Q n = {Gt : T G [K] n } and Q = {G r ^G^G [K]}. 
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Step S2 is established by showing that the sets Q n and Q converge in Hausdorff distance. This 
will require the following facts. The Hausdorff distance (in Euclidean norm) between two sets 
E\ and B 2 is defined as 

dums(Ei, B 2 ) = max < sup inf \\B 1 — B 2 \\, sup inf \\B 1 - B 2 

1 Bx GBi S 2 6B2 b 2 gb 2 Bl eBl 

Given a Hilbert space H and a set B C HI, let Tg : HI n- M denote the support function of B, 
defined as 

T b (H) = sup (H,B). 

BeB 

It is known that the convex hull conv(£>) equals the intersection of its supporting hyperplanes: 

conv(£>) = {xel:(x,ff)< T B (H) for all H e H} , 

and that the Hausdorff distance between conv(£?i) and conv(£> 2 ) is given by [27, Thm 1.8.11], 
[3, Cor 7.59] 



d Haus (coiw(Bi),coiav(B 2 )) = sup \T Bl (H) - T B2 (H)\. (9) 

H:\\H\\=1 

To establish S2, Lemma 2 will show that 


sup 

|r e,(H) - T g (H)\ = Op( A'(logn)n-‘/ 2 ), 

GO) 

H:\\H\\=1 

and Lemma 3 will show that 

dn ms (cony(g),g) = 0. 

(ID 


By (9) and (10), conv( 0«) and conv(O) converge in Hausdorff distance, which by (11) implies 
that conv(C„) and Q converge in Hausdorff distance. This implies that for each Gt € Q n , there 
exists G t e Q such that maxy || Gt — G r || — )■ 0. This will establish S2, since Gt ~ G T implies 
by (7) and (8) that $^(*5, T) & T , [ ^(S', r) uniformly over S G [. K] m , and it also implies that 
n T ~ 7 r T as well. 

The proof of S3 will be similar to S2. It can be seen that $ W (S', r) and ^(cr, r) can be 
written as 


[$«(£, T)]st = ( fs=s , It =t) and [^(c, r)] at = (f a =s , 1 T =t), 

where the functions fs= s , 1 T =u and f a=s are given by 

1 if r(y) = t 


( 12 ) 


i r=t(y) = 


0 otherwise. 


fs=s(y) = — Y2uj(xi,y)l{Si = s} 

m A ^ 


i— 1 


fa= s (y) — / uj(x,y)l{a(x) = s}dx. 
J x 
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Analogous to S2, we will define sets Fs and F a given by 


F s = {fs=u ■ ■ ■ ,fs=K,n s ) and F a = (f a=1 ,..., f a=K , n a ), 


whose possible values are given by 


T n = {F s :Se [K] m } and F = {F a : a e X ^ [. K ]}. 


Lemma 2 will show that the support functions and Tjr converge, and Lemma 3 will show 
that d H aus(conv(J r ), F) = 0. Using (12), this will establish S3 by arguments that are analogous 
to those used to prove S2. 

5.2 Intermediate Results for Proof of Theorem 1 

Lemmas 1-3 will be used to prove Theorem 1, and are proven in Section 5.4. 

Lemma 1 states that <f>A ~ $ w for all (S', T ). 

Lemma 1. Under the conditions of Theorem 1, 


max||$ A (S,T) - <MS,T)|| 2 = ((log K)^ 1 ) . 
0,1 


(13) 


Lemma 2 states that the support functions of Q and Q n and of F and T n converge. 
Lemma 2. Under the conditions of Theorem 1, 


sup \Tg n (H) - Tg(H)\ < 0 P (K (log n)n~ 1 / 2 ) 


(14) 


ll^ll=i 


sup \Tjr m (H) - r T (H)\ < Op{K (logm)m- 1 / 2 ), 


(15) 


ll^ll=i 


which implies 


d Haus (coiw(Q n ),coiiv(Q)) < 0 P (K(log n)n 1/2 ) 
d/faus(conv(F m ), coiw(F)) < 0 P (K(logm)m~ 1/2 ). 


Lemma 3 states that the sets T and Q are essentially convex. 
Lemma 3. It holds that 


d H au S (conv(g),g) = 0 
d H aus(couv(F), F) = 0. 


(16) 

(17) 
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5.3 Proof of Theorem 1 

Proof of Theorem 1. We bound ||$ W -(S', T) — r)|| 2 uniformly over S, as follows: 




K K 




S— 1 t= 1 

K K 


^ 5=s ’ 9T=t ~ 9r=tY 


s=l t= 1 



where (18) holds because m _1 ||ls= s || 2 = 1. 

By Lemma 2 and Lemma 3, it holds that d Ha us(conv(^ n ), Q) = Op(K(logn)n~^ 2 ). Given 
T, let r = Tp denote the minimizer of || G p — G T || = {Gt — G T , G p — G r ). It follows that 


K 


max || G t ~ G r || 2 = max V' —1| g T=t - g T =t\\ 2 + \\kt ~ vr T 
t t m 

t=\ 

' K 2 logn' 


n 


= O f 


Combining (13), (19), and (18) yields 


max ||(-S', T) - $u(S, r T )|| 2 + ||7 t t - 7t T t || 2 = O p 

S,T 


K 2 log n 


n 


(19) 


establishing (1). 

The proof of (2) proceeds in similar fashion. The quantity || (S, r) — (a, r) || 2 may be 

bounded uniformly over r: 


K K 

- $ w (a,r)|| 2 = 5E5E([ $w (^’ T )] st _ [^w(o-,T)] rt ) 2 

S=1 t= 1 
K K 

= ^2 EZ (fs=* ~ ?*=*’ 1 ' r =^ 2 

s=l t =1 

< (f^ll fs=s-fa= 

\s =1 



( 20 ) 
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where all steps parallel the derivation of (18). It follows from Lemma 2 and 3 that d H aus(conv(J : m ) ; , J 7 ) 
Op(K (logm)m -1 / 2 ). Given S, let a = os denote the minimizer of ||Fs — F a ||, so that 


max 


s 



( 21 ) 


Combining (21) and (20) yields 



establishing (2) and completing the proof. 


□ 


5.4 Proof of Lemmas 1-3 

The proof of Lemma 2 will rely on Lemma 4, which is a very slight modification of Lemma 
4.3 in [4]. Lemma 4 is proven in the Appendix. 

Lemma 4. Let HI denote a Hilbert space, with inner product (•, •) and induced norm || • ||. Let 
g : y K >■ H, and let yi,... ,y n G y be i.i.d. Let L n : HI A M be defined as 




( 22 ) 


LefH = {H e HI a : \\hk\\ < 1 ,t £ [A']}. It holds that 



To prove Lemma 3, we will require a theorem for finite dimensional convex hulls: 

Theorem 3. [27, Thm 1.1.4] IfB C and x G conv(£>), there exists If,.... B d+1 such that 
x G convjSi,.. .,B d+ 1}. 

Additionally, we will also require some results on Hilbert-Schmidt integral operators. A 
kernel function oj : X x y i —> M is Hilbert-Schmidt if it satisfies 



It can be seen that ui defined by Definition 1 is Hilbert-Schmidt. Let 0 denote the integral 
operator induced by u, given by 
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It is known that a Hilbert-Schmidt operator II is a limit (in operator norm) of a sequence of 
finite rank operators, so that its kernel ui has singular value decomposition given by 


ufay) = ^ A q u q (x)v q (y), 

9=1 

where {u q }f =1 and {vg}^ are sets of orthonormal functions mapping X hg M and y hg M, 
and Ai, A 2 ,... are scalars decreasing in magnitude and satisfying < °°- 

Proof of Lemma 1. Given (S, T ), let A G [—1, \\ KxK denote the quantity 

m n 

A st = -V V(A,- - W^liSi = s, Tj = t). 

i=i j =i 

It holds that E[A|1L] = 0, and by Hoeffding’s inequality, 

P (| A rt | > e| W)< 2e~ 2nmt \ s, t G [K\. 

Conditioned on W, each entry of A is independent of the others. Given 5 G [—1, l] KxK , it 
follows that 


K K 


P (A = 5\W) = n nP (A st = \W) 

S=1 t= 1 

/ K K 

< 2 exp I —2 nm EE 5 


S=1 t =1 


Let B denote the set 


B = < <5 G [-1, l] KxK : ^ S st > e supp(A) j . 

The cardinality of B is smaller than the support of A, which is less than ( nm) K2 when condi¬ 
tioned on W. It follows by a union bound over B that 

P (A G B\W) < 2\B\e~ 2nme 

< 2 {nm) K2 e~ 2nme . 

It can be seen that || < T ) yl ( i S', T) — $^(5) T) || 2 = implying that A G B is equivalent to 

the event that ||$a(5', T) — T) || 2 > e. A union bound over all S, T implies that 

P (max 11^(5, T) - $ W (S, T)\\ 2 > e^j < 2 K n+m (nm) K2 e~ 2nrne . 

Letting e = C(1 + n/m) (log K)n~ l for some C proves the lemma. □ 
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Proof of Lemma 2. Let g y G [0, l] m denote the column of W induced by y G y, and let f x G 
[0, l] 37 denote the row of u corresponding to x G X: 

g y (i) = uj{xi,y) : i G [m\ and f x (y ) = u(x,y), y ey. 


Algebraic manipulation 
1 


OAAV/VVO j J — 

n 


gT=t = -H g yM T i = t ) 


3 = 1 
m 


fs., = ~Y,frMS,= 

i= 1 

Given H = (hi,..., h K , n H ), it follows that 
and (H, Ff) equal 


g T =t = / 9 yl(r(y) = t ) dy 

■h 

fa=s= / f x l(a(x) = s)dx. 

J X 

the inner products (H, Gt) , ( H , G r ), ( H, F s ), 


1 

(H,G t ) = -Y, 

3 = 1 

1 m 

{H,F S ) =-JUKI'S., f*) + *H(Si)] 
m ■' 


<#, G t ) = 


+ tt h (t(j/)) dy 

(H, F a ) = / (K {x) ,f x ) + 7t^((t(x)) dx, 

J X 

and hence that the support functions equal 

r gn(#) = +7r g (fc), r 0 (if) = [ max/h k ,-^=\+n H (k)dy 

n ~[ k ^\ V m / Jyke[K]\ y/m/ 

1 m 

^ m ax (h k , f Xi ) + n H (k), T t (H)= / max (h k , f x ) + n H (k) dx, 

m fce[K] fce[K] 


which implies that Er^tf) = r 6 (i/) and ErV m (LT) = Y r (H). 
To show (14), we observe that Yg n can be rewritten as 


1 J a 

r g n ( H ) max 

n^fcepsri 
j= 1 


hk 


1 

1 

f— 1 

to 

2r 

_i 

_ 7Ti/(fc) _ 

5 

l 


J=! L J L J 

which matches (22) so that Lemma 4 can be applied. Applying Lemma 4 results in 

AK 


E sup \Tg n (H)-Tg(H)\ < 
l|H||=l 


AK 
n' 


(23) 


where we have used {// : ||i/|| = 1} C H and 

Let Z(y x ,.. 

Z by at most 4/ 


m ll2 g y . 


Let Z(y 1 ,...,y n ) = sup ||^ N=1 \Tg n (H) -T g (H)\. For £ G [n 

. . a / Applying McDiarmid’s inequality yields 


< 2. 

, changing y e to y\ changes 


P(| Z - EZ\ > e) < 2e 


—2e 2 n/8 
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Letting e = n _1//2 logn implies that Z — E Z = Op(n~ 1//2 logn), which combined with (23) 
implies (14). 

To show (15), we observe that 


rvjtf) 


l 

m 


m 


E 


max 

ke[K] 


hk 


fxi 


5 

1 


) 


so that Lemma 4 and McDiarmid’s inequality can be used analogously to the proof of (14). 

□ 


We divide the proof of Lemma 3 into two sub-lemmas, one showing (16) and the other 
showing (17). This is because the proof of (17) will require additional work, due to the fact that 
the elements of T are infinite dimensional. 

Lemma 5. For each G* G conv(f?), there exists G 2 , ■ ■■ G Q such that lim^oo ||G* — Gf\\ = 
0 . 

Lemma 6 . For each F* G conv(7 r ), there exists Fi, F 2 ,... G F such that Hindoo ||F* — Ff\\ = 
0 . 

Proof of Lemma 5. Recall the definition of g y G [0, l] m as defined in the proof of Lemma 4: 

g y (i) = u(xi,y), i G [m], 

and that g T=t can be written as 

g T =t = / g v Hr(y) = t} dy. 

Jy 

We note the following properties of {g y : y G y}\ 

PI: Each G* G conv(^) is a finite convex combination of elements in Q. This holds by 
Theorem 3, since Q is a subset of [0, i] mK + K , a finite dimensional space. 

P2: For all e, there exists a finite set B that is an e-cover of {g y : y G y \ in Euclidean norm. 
This holds because {g y : y G 3^} is a subset of the unit cube [0, l] m . 

By PI, each G* G conv(O) can be written as a finite convex combination of elements in Q, 
so that for some integer N > 0 there exists G n ,, G TN G Q such that 

N 

G* = J2hiG Ti , 

i =1 

where 77 is in the iV-dimensional unit simplex. It follows that for some y : y (->■ [0,1]^ 
satisfying Pk{y) = 1 for all y, G* = (gf ...,g* K , ir* G ) satisfies 

9*k= g y Pk{y)dy and n G (k) = / y k (y)dy, k G [. K]. 

Jy Jy 
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We now construct r : X (->■ [A"] inducing G r E Q which approximates G* e conv(Cy). By P2, 
let B denote an e-cover of {g y : y e J 7 }, and enumerate its elements as bi,..., b\ B \. For each 
y G y, let l : y (->• [\B\ assign y to its closest member in B, so that || g y — b^ y ) || < e. For 
i = 1,..., \B\, let y, denote the set {y : £(y) = i}. Arbitrarily divide each region y r into K 
disjoint sub-regions y a , ■ ■ ■, J 7 ^ such that U k y ik = y where the measure of each sub-region 
is given by 

f ldy= [ g k (y)dy, k e [K]. (24) 

Jy ik 

Let t \ y ^ [K] assign each region y ik to k, so that 

r(y) = k for all y e y ik ,i = 1, ..., \B\. 


By (24), it holds that ir T = iT q, and also that 


g T =k ~ g* k = / 9 y [l{r(y) =k}~ ii k {y)] dy 

Jy 

= / [ b e(y) +9y~ b e(y)\ = k} - y k {y)} dy 

Jy 

\B\ 


E b ‘ f 

i Jy 

i=i ^ 


Idy- Bk(y) dy 
'y ik Jyi 


+ / (g y ~ b e {y)) [l{r(y) = k} - g k (y)} dy 
Jy 


=0 by (24) 


= 0 + / (g y - b e{y) ) [1 {r(y) = k} - g k (y)} dy , 
Jy 


which implies that 


\\gr= k -g*k\\ < 


'y 


( 9 V ~ bt(y))l{r{y) = k} dy 


+ 


'y 


( 9y b £(y)') Bkiu) dy 


<2 || g y - be(y) II dy 

Jy 
< 2e. 

It follows that ||G t — G*|| 2 = Yh k =i m ~ 1 \\9r=k ~ 9 k \\ 2 + ||7r T — 7r^.|| 2 < 4Ke 2 m~ 1 , and hence 
that linig^o ||G T — G*|| = 0, proving the lemma. 

□ 

Proof of Lemma 6. Recall the definition of f x : y (->• [0,1] as defined in the proof of Lemma 4: 

fx(y) = u(x,y), 

and that f a=s can be written as 


fa=s = / f x l{a(x) = s}dx. 
J X 
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Because {f x : x 6 X} is not finite dimensional, the arguments of Lemma 5 do not directly 
apply. To circumvent this, we will approximate the space T by a finite dimensional T. such 
that the convex hulls conv(J r ) and conv(J r ) converge. 

For Q = 1, 2,..., let ujq be the best rank-Q approximation to c u, 

Q 

VQ(x,y) = q u q (x)v q (y). 

9=1 

Given D > 0, let u q denote a truncation of u q , defined as 

{ D if u q (x) > D 

u q (x) if — D < u q (x) < D 
—D if u q (x) < —D, 

and let cD : Af x y H > M be defined as 

Q 

w(x,y) = ^\ q u q (x)v q (y). 

9=1 


Let f x : y t-)- M and f a=s be defined as 

fx(y)=&(x,y) and f a =s= f x l{a(x) = s} dx. 

J x 

Let F a and T be defined as 

K = (jU,..., j a=K , 7T(t) and T = {F a : a E [K] x }. 

We bound the difference \\f x — f x || 2 : 

Q oo 

Wfx - fx\\ 2 = ^ \ 2 q (u q (x) -U q (x)) 2 + 

9=1 q=Q+ 1 

where we used the fact f x = Y^Li \ u q {x)v q , and that the functions {v q } are orthonormal. It 
follows that 



fx\\ 2 dx = X^A 2 q f 
q=l Jx 

Q r 

=£'•/, 

S E Kj 

0=1 Jx: 


U q {X) - U q [X 


U q {X) - U q (X 


co „ 

)) 2 dx + ^ A 2 / u q (x) 2 dx 

q=Q +1 J* 

OO 

)) 2 dx+ A 9 


q—Q~\~ l 

OO 

■Ug(a;) 2 Gh+ ^ A 2 , 

q=Q +1 
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from whence it can be seen that 


lim / 

min(Q,Z))— >-oo J^ 

We use this result to bound \\f a=s — f a =s\\- 


II fx - fx\\ 2 dx = 0. 


max 11 /cr= s - fa=s\\ 2 = max 

s,a s,a 


(fx fx) 1(T=S (•t') dx 


1 X 


< / \\fx-fx\\ 2 dx 


' X 


—¥ 0 as min(Q, D) oo. 

Since \\F a — F a || 2 = Ylk=i II fa=k ~ fa=k\\ 2 + \\^a — ^a\\ 2 , it follows that for any e > 0, there 
exists (Q, D ) inducing T = {F a : a e [K] x ] such that 

sup ||F ct - F ff || < e, (25) 


so that the support functions of T and T can be bounded by 

sup |IV(if) — Tj r (F[)\ < max 

H:\\H\\ = 1 


H 11=1,CT 

< max || F„ — F„ 


H, F a - F a 


< e, 


implying that 

^Haus(conv(J')),conv(J r )) < e, (26) 

which in turn implies that for any F* e convfj 7 ), there exists F* 6 couv(F) such that ||F* — 
F*|| < e. 

For any choice of (Q, D ), we observe that properties PI and P2 as described in Lemma 5 
for Q also hold for T\ 


PI: Each F e conv(J r ) is a finite convex combination of elements in T . This holds because 
each f x can be written as 

Q 

fx = 'y ^ ^qfq{ 

q= 1 


(XV, 


qi 


showing that { f x : x G X} is a finite dimensional subspace of y M, and hence T is 
as well, allowing Theorem 3 to be applied. 


P2: For all e, there exists a finite e-cover of {: x e X} in Euclidean norm. This holds 
because the set {u(x) :xGT}isa subset of the hypercube [-D, D] Q . 
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As a result, the same arguments used to prove Lemma 5 also apply to T , implying that for each 
F E conv(J r ), there exists for any e > 0 a mapping a : X t->- [K] such that 


\F a -F\\ 2 < AKe 2 


(27) 


It thus follows that for any e > 0 and F* E convfj 7 ), there exists F* G conv(j r ) and cr : X H > 
[K\ such that 

H-F* - F a || < ||F* - F*|l + ||F* - F a || + || F a - F CT || 

<eby(26) <4A'e 2 by(27) <e by (25) 

< 2e + 4e 2 A'. 

As a result, it follows that there exists F\ , F 2 ,... G J such that lim^oo ||F* — Fi\\ = 0. □ 

Proof of Lemma 3. Lemma 3 follows immediately from Lemmas 5 and 6, which establish (16) 
and (17) respectively. □ 

A Proof of Lemma 4 

To prove Lemma 4, we will use a result from [4], which we state and prove here: 

Lemma 7. [4, Lemma 4.3 ] Let El denote a Hilbert space, and let g : y (->• H. Let y\,, y n E 
y be i.i.d, and let L n : EI A ^ M be defined as follows: 

1 n 

L n (H) = - ^2 max ( h t, g(yj)) , H = (hi,h K ) e EI A 

3 =1 

Let B = {H G H a : \\hk\\ < l,k G [K ]}. Then the following three statements hold: 


1 x \ 

E sup L n (H ) — E L n (H) < 2E sup — e 7 - max ( h t , g(yf )), 

hgb h&b n ' (g[a] 

7=1 


(28) 


where e\,,6j l ~ ±1 vup. 1/2, 


^ 1 71 

E sup - V" e 5 - max (h t , g(yj)) < 2KE sup - V" e^- (/i, g(y j )), 
<G[A] IHI=i n “t 


(29) 




E sup 1$: £ , <Mfe)> <(SiM 


2 \ 1/2 


11=1 " ,=1 


(30) 
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Proof of Lemma 7. (28) is a standard symmetrization argument [9]. Letting y [,..., y'- denote 
i.i.d Uniform [0,1] random variables, and ei,...,e n ~ ±1 w.p. 1/2, it holds that 


E sup L n (H) - E L n (H) < E sup — max (h t , g{yf)) - max (h t , giy'f) 

h&b n ^ te[K] te[K] J 

.7=1 


H&B 


= E 


1 n / 

~J2 ei 9{yj)) - max (ht, g{y'j )> 


hgb n 

j= 1 


tG[/V] 


^ n i n 

< E sup - V] e* max (h*, p(j/j)) + E sup - V" © max (h t , g(y'f) 
h&b n ^ te[K] h&bU^ te[K] N 3 / 

3 =1 J=1 


2E sup - E gj max (h t , g(yj)) 


h&b n ' tepf] 
j=i 


To show (29), let denote the (non-absolute valued) Rademacher complexity of a function 
class T\ 

1 n 

^(J 7 ) = Esup- y^e j f{y j ). 

The following contraction principles for Rademacher complexity hold: [4, 22] 


1. 'A(|J 7 |) < 7 1(F), where |A| = {\f \ : / e J 7 }. [8, Thm 11.6] 

2. ^(J 7 ! © Tf) < H[fFf) + ft(A 2 ), where T X ®T 2 = {/, + / 2 : (/ 1? / 2 ) 6JiX J 7 ,}. 
For A" = 2, (29) follows from the following steps, 


1 


E sup — e, max (A, qiyf) = -E< sup — e 

rreB n J te[ 2 ] X *’2 \ hLb n ^ 

.7 = 1 v 7 = 1 


{hi,g(yj)) + (h 2 ,g{yj)) 


{hi,g(yj)) - {h 2 , gigjj)) 


^ n i n 

= E l sup - e 3 i h 1> 3(%)) + sup - Cj (h 2 , g{yf) 

n , u n — 1 77/ 


^ 11=1 ^ “ 

1 n 

= ATE sup - V €j {h, g{yf )), 
IM=i n 


^2 ||=1 


1=1 


which holds by max(a, 6) = (a+6+|a— b\)/2 and the contraction principles. The induction rule 
for general K is straightforward, using the fact that max(ai,..., ok) = max(max(ai,..., a/<-_] ), or-). 
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To show (30), observe that 


1 x A / 1 A 
E sup - V €j {h, g{yj)) = E sup ( h, - V e 3 g{y 3 
11*1!-1 n \\h i \ n ~[ 


= E 


n 




< E 


3 = 1 
1 


n 




3 = 1 


1/2 


= -E||r/(yi)|| : 

n 


1/2 


□ 


Proof of Lemma 4. (28) - (30) imply that 


E sup L n (H) - E L n (H) < K 

HeB 


( n\g(y)\\‘ 

V n 


2 \ 1/2 


It also holds that 


E inf L n (H) — E L n (H) > 2E inf e,- max (h tl giyf) 

HeB ny ’ ny ’ ~ HeBn^ J te[K] y 

3 = 1 


2E sup - ^(-e,-) max (h t , g(y 3 )) 

TT '~ r7 t€[K\ 


h&b n <- j=l 
1 


—2E sup — y ej max ( h t , g{yf)) 

HeB n .. te[K] 

3 = 1 


> -2K 


fE||g(y)ll ! 
V n 


2 \ 1/2 


(31) 


(32) 


where the first inequality holds by a symmetrization analogous to (28); the second by algebraic 
manipulation; the third because ei,..., e n are ±1 with probability 1/2; the fourth by (29) and 
(30). 

Combining (31) and (32) proves the lemma. □ 


B Proof of Theorem 2 

Preliminaries 

Let V = {c e [0,l) d : ||c|| < 1}, S = [K] x V,T = [K\ xV, and 0 = [0, l} KxK . Let 
V denote the smallest e-cover in 2-norm of V. Let S — [K] x T> and let T = [K\ x V. Let 
R = |5| = 171 < K(Vde- l ) d . 
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As described in Section 4.3, recall that we may write S,T, a and r as S = (U,B),T = 
(' V. , D),a = (/i, /3), and r = (z/, 5). Given S = (U, B ), let S denote its closest approximation 
in S m . This means that S = (U, B), with B G V m satisfying Bi = argmin ce -p ||£>* — c|| for 
i G [m\. Similarly, given T = (1/, D) or r = (z/, 5), let T = (V, D) or f = (y, 5) be defined 
analogously. 

Let Z G [0, l] mxn be defined by 

Zij = B,j DjWij, 

and let $z(U, V ) be defined by 

ran 

[<MV V)] ra = — E E Z «!{^ = «, ^ = »}• 

I I l I l 

i=l j=l 


Let v) and u) denote population versions of <&z, defined by 

1 m r 

[$C (U,u)]u V = — ^2 / Bj8{y)u{x h y)l{Ui = u, u{y) = v}dy, 


m ~7 Jy 
1=1 ^ 


[$c(/L U )\uv = / /3(x) T 5(|/)a;(x,2/)l{/i(a:) = tt,z/(y) = vjdxdy. 


'Xxy 


Let n§ =k , iTy =k , 7r =fe , and 7i 5 u=k be defined for k G [A"] as 


1 m 

7rr r _, 

rn z—j 


T B 
U=k 


t d 

[ V=k 




i= 1 3=1 


<= fc = / P(x)f3{x) l{g(x) = k}dx n v=k = / S(y)S(y) l{u(y) = k}dy. 


1 x 


>y 


We observe that Ylk=i |[Ey=/,-ll f < 1, since by triangle inequality, 


K 

E 

k =1 


|_B || / 

Ft/=fc||F < 


m 


E \\ B >% 


2 — 1 


< 1, 


where we have used ||L>j|| < 1 for all Bi G X>. 

Recall the definitions for g y G [0, l] m and f x '.y ^ [0,1]: 

g y (i) = u(xi,y) and f x (y) = u(x, y). 

Define for k G [AT] the matrices gy =k and gl =k in [0, l] mxr/ , and the functions ffj =k and f 3 =k 
mapping y \-A V: 

l . n - r 

9v=k = -^ayjDjHVj = k} g S u =k = / 9yti(y) Tl {v(y) = k} dy 

n 3=1 ' J y 

m 

f$-k =f*Am = i] sU = 

III 


f x (3(x)l{n(x) = k}dx. 


I x 
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Define the matrix 1§ =U € [0, l] mxd and function l 5 u=v : y H» V 


1 


B 

U=u 


ihj) 


Bi(j) if Ui=u 

0 otherwise 


,S / \ _ j ( Kv) if v{y)=v 

y 0 otherwise. 


We observe that m 1 Ylk=i l|i-t/=fcll 2 — 1 since \\Bi\\ 2 < 1 for all B t e V. Analogous to 
Gt, G t ,Fs , F a as defined in Section 5.1, let Gy, G\ , F/j, and F'P be defined by: 


piD _ 
Lry - 


t?B _ | ra ts « 

r U — Ju= 1) • • • > Ju=Ki n U= 1) • • • 5 n U=K> 


9v =i 


9v=k n 


cB 


'm 

cB B 


1 7r y=i > • • ■ j ^v=Ki 


\I/ jn 

~Kd 


S~10 _ 

Lr iy — 


B 


Kd 


gLi 

n < 5 

9u=k 

s/m 1 ' 

s/m 

f 

J fl= 1 1 ’ 

fP 

■ ■ J J n=K’ 


5 ^l>=\ 5 * * * ? ^l/=K ? 


Jid 

Kd 


Define the sets .F m , (7 n , J 7 , and C? by 

•Fn = {Fg:S = (U,B) e S m } T={F^.a= (pj) & X S} 

Q n = {G°:T= (V,D) e T n } s = {Gl :f = {v,5)ey ^ T}. 


B.l Intermediate Results for Proof of Theorem 2 

Lemmas 8 and 9 are analogs to Lemmas 2 and 3. 


Lemma 8. Under the conditions of Theorem 2, 



sup I Tg n (H) - Tg(H) 1 < Op(K (log n)n~ 1 ^ 2 ) 

\\H\\=1 

(33) 

which implies 

sup \TpjH) - r?(H)\ < Op{K{\ogm)m- 1 / 2 ), 

\\H\\=1 

d H aus (conv (£?„), conv (£?)) < 0 P (K (log n)n' i/2 ) 
d H aus( conv(F m ),conv(F)) < 0 P (K (logm)m _1/2 ). 

(34) 

Lemma 9. It holds that 



d H aus(cOUv(Q), Q) = 0 

(35) 


dHaus{conv(F), F) = 0. 

(36) 


Lemmas 10 - 12 bound various error terms that appear in the proof of Theorem 2. They 
bound on the approximation error that arises when substituting (S', T), and also the differences 

I Ra(S,T; 0) — R W (S,T; 9)\,\R W (S,T ; 9) - R^S, r; 0)| and 1^(5, r; 9) - R^d, f; 9)\. 
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Lemma 10. It holds that 


\R a (S,T;9)-R a (S,T-,9)\ < 12e (37) 

\R U] (S,r;9)-R UJ (S,T ] 9)\<12e 
|-fiUer,'r;0) - R u (<j : f;9)\ < 12e. 

and that 

ll^s-^sll 2 <Kde 1|^ T - || 2 < Kde (38) 

\\^a - ^a\\ 2 < Kde ||tf T - ^ f || 2 < Kde. 

Lemma 11. If K < n 1 / 2 , it holds that 

\R A (S,f-9)-R w (S,f-9)-C 1 \ < 2KO P (K(logn)^- 1 ), (39) 

Lemma 12. Given T = (V, D) G T n , let t = (i/, 5) G ^ H >• T minimize ||Gy — Gl\\. It holds 
that 

\R W (S, T\ 9) - R^S,f-9) - C 2 1 < 0 P (KK(\ogn)n- l/2 ). (40) 

Given S = (U , B ) G <S n , to d =(//,, j3) E X S minimize \\Fjf — Ffj ||. 7t to/to ftof 

\R OJ (S,f-9)-R u (d,f- 1 9)-C 3 \ < 0 P (KK(\ogm)m- 1 / 2 ). (41) 


B.2 Proof of Theorem 2 


Proof of Theorem 2. Given T = (V, D ), let f = (v, <5) minimize \\Gy—G 5 U \\, which by Lemmas 
8 and 9 is bounded by 0 P (K(logn)n~ 1 ^ 2 ). Using this fact and (38), the quantity ||\b T — \J7,-1| 2 
can be bounded by 

\\^ T - \! 7 r || 2 < 2 ||^ t - || 2 + 2 ||\ l 7 j . - || 2 

< 2||4/ r — 4U|| 2 + 2||Gy — Gl\\ 2 

< 2Kde + Op(R 2 (\ogn)n- 1 ) 1 . (42) 


Using (37), (39), (40), (41), and (42), it holds for K < n 1 / 2 that 

\Ra(S, T; 9) - R U (S, r; 9)-C 1 -C 2 \ + < |7?a(S, T; 9) - R A (S, T ; 9) \ 

a 


+ | Ra(S,T- 9) — R W (S, T; 9) — C\ 
+ | R W (S,T- 9) — Rcj(S, r; 9) — C 2 \ 
+ \R u (S,f,9)-R u (S,f,9)\ 


I'hr - 4' 


Kd 

<26e + 0 F \RKZl> 

n 


+ o p (kk*M 

(43) 
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Using K < K(d 1 ^ 2 e l ) d and letting e = d ^ 1/ l° gn ^ j 1+d yields that K < n 1//2 eventually, so 
that substituting into (43) yields 


\R A (S,T ] 9)-R U} (S,f,0)-C 1 -C 2 \ + 




T 7 1 


Kd 


< Op 


' d l/2 ( K ' 2 lQ S n 


V n 1 ' 2 


1 

1+d 


proving (5). 

Similarly, it holds that 

I Ru>(S,t] 9) — Ru(d, r; 9) — C 3 \ + 


11^5- 

Kd 


< \K(S,r-,9)-K(S,f,9)\ 

+ | Ru{S, t ; 9) - R u (d, f; 9) - C 3 \ 
+ \Ruj(d,f- 9) - R U} (d, t; 9)\ 


Kd 


jlog(m) 


< 26e + Op ( K 2 ^ by ' UJ ) + Op ( KK 


m 




log(m) 




m 


1/2 


and letting e = 


_ ( K 2 d d / 2 log m\ !+ d 


% l/2 


proves (6). 


□ 


B.3 Proof of Lemmas 8-12 

Proof of Lemma 8. Let H = (hi,, h K , i ly,..., n v , 'h H ), where h k G W nxd , n k e M dxd , and 
■ [AT] [0,1]. Given (v, d) G [A"] x V, let l v ^ d : [A'] xl>4[0,1] denote the indicator 

function 

1 v ,d(k, c ) = l{u < k,d < c}. 

Given G ^ G Q n and G\ G Q, the inner products ( H , G and (h, G^}j equal 


K 


H,G°) = 


D \ K 

9v=k 


+ v-> +*5 <*».**> 

k= 1 \ V / k =1 

9y : , D ] 


i E 

71 . 

i=i L 


hv n yj /^ ) + (nvpDjD j) + -j—j h , lyj.-D,-) 


ir 




5 \ k 

9l.t 


i-*) + *5 <*'■*'> 

fc=l \ v / fc=l 

+ <7A(?/)>%)%) T } + ^ ^(2/),%)) d V- 


Ms/)’ 
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It follows that the support functions Tg n and Yg equal 


1 

Yg ( H ) = — max _ 

n^ve[K],cGV_ 


1 


9yj c 


+ (tTjj, CC T ) + (th H, 1 v,c) 


max 

n f ( v,c)eT 

3 = t 


E 

3 = 1 

Yg(H) = / max _ 

Jy ve[K],c£V 


h v c 

(n v ,cc T ) 

(■ Kd V) 



on 1/2 g yj 


1 


- 1 

T—i 


h v 5 


9y° 


= / max 
Jy (v,c)eT 


m 
h v c 

( 7 r„,cc T ) 

(Kd)- 1 {* H ,lv,c) 


+ (7T„, CC T ) + — ('Ttf, 1 W)C ) 



1- 

7 

5 

1 


— 1 

T-1 


% 


Given t = (v,c) E T, let h' t = 


Since ||c|| < 1 and \\(Kd) 1 l^, c || < 


h v c 

(tt v ,cc t ) 

(Kd)- 1 (V H , l„, c ) 

1/d, itfollows that ||^|| 2 < ||^||^+ ||7r„|||,+ \\^Y h \\ 2 /d 2 , where || • ||^ denotes Frobenius norm, 
so that if ||i/|| < 1, then \\h' t \\ < 1 for all t E T. As a result, the proof of Lemma 2 can be 
copied here: Lemma 4 implies that 

E sup |r s (ff) - Tg n (H)\ < 54, 
ll^ll=i V n 


and McDiarmid’s inequality applied to Z = sup||#|| =1 \Yg n (H)—Yg(H)\ implies that Z—EZ = 

Op(n~ 1/l2 log n). 

The proof for sup||#|| =1 F (H) — Ty(II) | follow parallel arguments. □ 


Proof of Lemma 9. Enumerate the members of T as 1 ,,K. Given (u. c) E T, let t(u , c) 
denote its corresponding index in 1,..., K. Given T = (V, D) E T, recall the definition of 


Gr = 


9t= 1 


9t=k 


m 


'1 


a vector in W riR +K . It can be seen that 


s~iD _ 
Lj v - 


9v =1 9v=k „d 

m JKi ,7T V =1 ,--;*V=K, Kd 
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is a linear transformation of Gf, given by 


9v=k ~ 9T=t(k,c)° T ) G [A] 

c£T> 

7ly =k = TTf(t(k, c))cc T , k G [K] 

cev 

K 

V I ; T = 7Ti(fc,c)lfe,c- 

fc=l cGX> 


By Lemma 3, it holds that Q = {Gf : f e T} is convex. Since £/ = {Gy : T — (V, D) G T} 
is related to C? by a linear transform, and as linear transformations preserve convexity, it follows 
that Q is also convex. By parallel arguments, it also follows that A is a linear transformation of 
T and hence convex as well. □ 

Proof of Lemma 10. If 11 B, — B,\\ < e for all i G [m] and 11 1), — D,\\ < e for all j G [n], then 


\R A (S,T-6)-R A (S,f-6)\ < 


— E E( A « - - (A,- - SID,0u,v,)- 


i =1 j =1 


< 12e, 


where we use the fact that ||Aj||, 11 Dj j|. 0 UV , and A l} are all between 0 and 1. By similar argu¬ 
ments, it also holds that 


R U} {S,t;0)-R U} (S,t-9) | < 12e 
\R UJ (a,T;9) - R Ld (a,f]6)\ < 12e. 


We also show that \\^s ~ ^sll 2 < Kde by 


K „ r 1 m 

- 4'sii 2 = E / -Ehw hu, 

k=i R°R d m i =i 


< A A 




dc 


K „ , m 

< V] / — V [l{C/j < k, /3i < c} - l{Ui < k,(3i< c}] 2 dc 

k=i J l°’0 d m i= i 
-. m K „ 

= — V] V] / 11{C/t < A A < c} - 1 {Ui < AA < c}\dc 

Ao,i) d 

< Kde, 


where the first inequality holds by Jensen’s inequality, and the second inequality holds because 
||A — All < e and the integral is over [0, l) d . The quantities H'I't — || 2 , H^o- — \IA|| 2 , etc., 

are bounded similarly. □ 


29 







Proof of Lemma 11. Given 9 E [0, l] AxA , let 9 E [0, l] AxA be given by 

0 s t = b T dO uv for all s = (u,b) E S,t = (v, d) E T. 

For £ = ((/, £) G S m and f = (1/, D) E T n , 

R A (S,f-,e)-R w (S,f-e) = C x - 2j2J^(l$A(S,T)] st -[$ w (S,f)] st )9 st , (44) 

sG<S t^zT 

where C A is constant in (S, T, 9). This implies 

I Ra(S,T-, 9) - R W (S,T; 9) - Cf < 2\\$ A (S,T) - 

<2K\\$ A (S,f)-$ w (S,7)\\ 2 
< 2KOp{K(\ogn)n~ 1 ) 

where the inequalities follow by (44), the equivalence of norms, and Lemma 1, which requires 

K < n 1/2 . □ 


Proof of Lemma 12. It holds that 

Rw(S, T ; 9) - Ru{S, f,9) = C 2 -2^Yl V )U ~ [$f (U, V)} uv ) 9 U 


K K 


u =1 v=l 
K K 


' K U=ui n V=v 


+ EE 

U =1 V =1 

where C 2 is constant in S, T, 9 and f. This implies 

| Rw(S, T- 9) - R U {S, f-9)-C 2 1 < || $ Z {U, V) - $ C {U, i/)||i 


* U=n, K=v ) ) 9 


f . 

' uv 


K 


K 


J2^U=u\\ E 


Fv=» — n u=v\ 


<U= 1 


. V=1 


K 


1/2 


< K\\®z(U, V ) - $ C (U, u)\\ + y/K ^ 


b II2 


n V=v ~ n u=v\ 


, V=1 


(45) 


where the final inequality uses the fact that ffu=i || 7r ^= w || E 1. 

It can be seen that the entries of &z(U, V) and ( U. u) equal the inner products 


^ <!?=„, 9®., 


v)}„ = — ( lg.„, ) , 


which implies 


K 


K 


|4 z (t/,V')-4 c ({W)l| 2 < E™# 1 

\ _, / /1/ 


B n 2 
t/=ul 


K 


E -iisfu - F.„i 12 




(46) 


1/=1 
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Given Gy G Q n , let G s u G Q minimize ||Gy — G s v ||. Using (45), (46) and Lemma 8 implies 

| R W (S, f; 0) - R^S, t ; 9) — C 2 \ < V2K\\G° - Gi\\ 

< Op(KK(\ogn)n 1 ^) 

Similarly, it holds that 

K K 

R UJ (S,f,9)-R UJ (a,F,6)=C 3 - 2 EE 

U= 1 1!=1 

+ S - (K=u, *Lv)) d lv, 

u=l V=1 

where C 3 is constant in S, f, a and 6. This implies 

\R»{S, f; 9) - R uu (a, f ] 6)-C 3 \< 2||$ c (f/, v) - $ c (/x, v) ^ 

/ K \ V 2 

< 2K\\*c(U, u) - #c(/i, I/)II + ViT ( £ I|7rg =u - TT^JI 2 j 

(47) 

where we have used Y^ v ||7rf=J| < 1. 

The entries of $^((7, z/), and u) equal the inner products 

[<T C (U, z/)] tt1 , = 1 £=„) and [<T c (/i, i/)] uw = ^/ A f =rt , l£=t,) , 

which implies 

ii*<w o -»<(/*.on 2 < (e^ii/#=«-/Eh 2 ) (e^pLii 2 ) 

- ~n fu= u ~ fi=u ii 2 - ( 48 ) 

u=i 

Given G F m , let F& G F minimize ||F^ — F &||. It follows from (47), (48), and Lemma 8 
that 

I MS, f; 0) - iL(d, f-e)-C 3 1 < - F 4 1| 

< Op(A'F'(logm)m~ 1 ^ 2 ). 

□ 
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